Capstone project: Providing data-driven suggestions for HR

Table of Contents¶

  • Pace: Plan
    • Imports
    • Data Exploration (Initial EDA and data cleaning)
  • pAce: Analyze Stage
    • Data Visualization and EDA
  • paCe: Construct Stage
    • Model Building
    • Baseline Models
    • Feature Engineering (Round One)
    • Feature Engineering (Round Two)
    • Model Evaluation Results
  • pacE: Execute Stage
    • Results and Evaluation

Description and deliverables¶

This capstone project is an opportunity for you to analyze a dataset and build predictive models that can provide insights to the Human Resources (HR) department of a large consulting firm.

Upon completion, you will have two artifacts that you would be able to present to future employers. One is a brief one-page summary of this project that you would present to external stakeholders as the data professional in Salifort Motors. The other is a complete code notebook provided here. Please consider your prior course work and select one way to achieve this given project question. Either use a regression model or machine learning model to predict whether or not an employee will leave the company.

In your deliverables, you will include the model evaluation (and interpretation if applicable), a data visualization(s) of your choice that is directly related to the question you ask, ethical considerations, and the resources you used to troubleshoot and find answers or solutions.

PACE stages¶

Screenshot 2022-08-04 5.47.37 PM.png

Pace: Plan¶

Back to top

Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

In this stage, consider the following:

Understand the business scenario and problem¶

The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to you as a data analytics professional and ask you to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?

Your goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.

If you can predict employees likely to quit, it might be possible to identify factors that contribute to their leaving. Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company.

Familiarize yourself with the HR dataset¶

The dataset that you'll be using in this lab contains 15,000 rows and 10 columns for the variables listed below.

Note: you don't need to download any data to complete this lab. For more information about the data, refer to its source on Kaggle.

Variable Description
satisfaction_level Employee-reported job satisfaction level [0–1]
last_evaluation Score of employee's last performance review [0–1]
number_project Number of projects employee contributes to
average_monthly_hours Average number of hours employee worked per month
time_spend_company How long the employee has been with the company (years)
Work_accident Whether or not the employee experienced an accident while at work
left Whether or not the employee left the company
promotion_last_5years Whether or not the employee was promoted in the last 5 years
Department The employee's department
salary The employee's salary (U.S. dollars)

💭

Reflect on these questions as you complete the plan stage.¶

  • Who are your stakeholders for this project?
  • What are you trying to solve or accomplish?
  • What are your initial observations when you explore the data?
  • What resources do you find yourself using as you complete this stage? (Make sure to include the links.)
  • Do you have any ethical considerations in this stage?

Stakeholders:
The primary stakeholder is the Human Resources (HR) department, as they will use the results to inform retention strategies. Secondary stakeholders include C-suite executives who oversee company direction, managers implementing day-to-day retention efforts, employees (whose experiences and outcomes are directly affected), and, indirectly, customers—since employee satisfaction can impact customer satisfaction.

Project Goal:
The objective is to build a predictive model to identify which employees are likely to leave the company. The model should be interpretable so HR can design targeted interventions to improve retention, rather than simply flagging at-risk employees without actionable insights.

Initial Data Observations:

  • The workforce displays moderate satisfaction and generally high performance reviews.
  • Typical tenure is 3–4 years, with most employees (98%) not promoted recently.
  • Workplace accidents are relatively rare (14%).
  • Most employees are in lower salary bands and concentrated in sales, technical, and support roles.
  • About 24% of employees have left the company.
  • No extreme outliers, though a few employees have unusually long tenures or high monthly hours.

Resources Used:

  • Data dictionary
  • pandas documentation
  • matplotlib documentation
  • seaborn documentation
  • scikit-learn documentation
  • Kaggle HR Analytics Dataset

Ethical Considerations:

  • Ensure employee data privacy and confidentiality throughout the analysis.
  • Avoid introducing or perpetuating bias in model predictions (e.g., not unfairly targeting specific groups).
  • Maintain transparency in how predictions are generated and how they will be used in HR decision-making.

Imports¶

Back to top

  • Import packages
  • Load dataset

Import packages¶

Load dataset¶

Pandas is used to read a dataset called HR_capstone_dataset.csv. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Data Exploration (Initial EDA and data cleaning)¶

Back to top

  • Understand your variables
  • Clean your dataset (missing data, redundant data, outliers)

Gather basic information about the data¶

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

Gather descriptive statistics about the data¶

Department value counts and percent:
              Count  Percent
Department                 
sales         4140    27.60
technical     2720    18.13
support       2229    14.86
IT            1227     8.18
product_mng    902     6.01
marketing      858     5.72
RandD          787     5.25
accounting     767     5.11
hr             739     4.93
management     630     4.20

Salary value counts and percent:
         Count  Percent
salary                
low      7316    48.78
medium   6446    42.98
high     1237     8.25
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

Observations from descriptive statistics¶

  • satisfaction_level: Employee job satisfaction scores range from 0.09 to 1.0, with an average of about 0.61. The distribution is fairly wide (std ≈ 0.25), suggesting a mix of satisfied and dissatisfied employees.
  • last_evaluation: Performance review scores are generally high (mean ≈ 0.72), ranging from 0.36 to 1.0, with most employees scoring above 0.56.
  • number_project: Employees typically work on 2 to 7 projects, with a median of 4 projects.
  • average_monthly_hours: The average employee works about 201 hours per month, with a range from 96 to 310 hours, indicating some employees work significantly more than others.
  • time_spend_company: Most employees have been with the company for 2 to 10 years, with a median of 3 years. There are a few long-tenure employees (up to 10 years), but most are around 3–4 years.
  • Work_accident: About 14% of employees have experienced a workplace accident.
  • left: About 24% of employees have left the company (mean ≈ 0.24), so roughly one in four employees in the dataset is a leaver.
  • promotion_last_5years: Very few employees (about 2%) have been promoted in the last five years.
  • department: The largest departments are sales, technical, and support, which together account for over half of the workforce. Other departments are notably smaller.
  • salary: Most employees are in the low (49%) or medium (43%) salary bands, with only a small proportion (8%) in the high salary band.

Summary:
The data shows a workforce with moderate satisfaction, generally high performance reviews, and a typical tenure of 3–4 years. Most employees have not been promoted recently, and workplace accidents are relatively uncommon. Most employees are in lower salary bands and concentrated in sales, technical, and support roles. There is a notable proportion of employees who have left. There are no extreme outliers, but a few employees have unusually long tenures or high monthly hours.

Rename columns¶

As a data cleaning step, rename the columns as needed. Standardize the column names so that they are all in snake_case, correct any column names that are misspelled, and make column names more concise as needed.

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'Department', 'salary'],
      dtype='object')
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_monthly_hours', 'tenure', 'work_accident', 'left',
       'promotion_last_5years', 'department', 'salary'],
      dtype='object')

Check missing values¶

Check for any missing values in the data.

satisfaction_level       0
last_evaluation          0
number_project           0
average_monthly_hours    0
tenure                   0
work_accident            0
left                     0
promotion_last_5years    0
department               0
salary                   0
dtype: int64

Check duplicates¶

Check for any duplicate entries in the data.

3008
satisfaction_level last_evaluation number_project average_monthly_hours tenure work_accident left promotion_last_5years department salary
396 0.46 0.57 2 139 3 0 1 0 sales low
866 0.41 0.46 2 128 3 0 1 0 accounting low
1317 0.37 0.51 2 127 3 0 1 0 sales medium
1368 0.41 0.52 2 132 3 0 1 0 RandD low
1461 0.42 0.53 2 142 3 0 1 0 sales low

There are 3,008 duplicate rows in the dataset. Since it is highly improbable for two employees to have identical responses across all columns, these duplicate entries should be removed from the analysis.

<class 'pandas.core.frame.DataFrame'>
Index: 11991 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     11991 non-null  float64
 1   last_evaluation        11991 non-null  float64
 2   number_project         11991 non-null  int64  
 3   average_monthly_hours  11991 non-null  int64  
 4   tenure                 11991 non-null  int64  
 5   work_accident          11991 non-null  int64  
 6   left                   11991 non-null  int64  
 7   promotion_last_5years  11991 non-null  int64  
 8   department             11991 non-null  object 
 9   salary                 11991 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.0+ MB
None
satisfaction_level last_evaluation number_project average_monthly_hours tenure work_accident left promotion_last_5years department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Check outliers¶

Check for outliers in the data.

Number of tenure outliers: 824
Outliers percentage of total: 6.87%

Certain types of models are more sensitive to outliers than others. When you get to the stage of building your model, consider whether to remove outliers, based on the type of model you decide to use.

pAce: Analyze Stage¶

Back to top

  • Perform EDA (analyze relationships between variables)

💭

Reflect on these questions as you complete the analyze stage.¶

  • What did you observe about the relationships between variables?
  • What do you observe about the distributions in the data?
  • What transformations did you make with your data? Why did you chose to make those decisions?
  • What are some purposes of EDA before constructing a predictive model?
  • What resources do you find yourself using as you complete this stage? (Make sure to include the links.)
  • Do you have any ethical considerations in this stage?

>

Two Distinct Populations of Leavers:
There are two major groups of employees who left the company:

  • Underworked and Dissatisfied: These employees had low satisfaction and worked fewer hours and projects. They may have been fired. Alternately, they may have given notice or had already mentally checked out and were assigned less work.
  • Overworked and Miserable: These employees had low satisfaction but were assigned a high number of projects (6–7) and worked 250–300 hours per month. Notably, 100% of employees with 7 projects left.

Employees working on 3–4 projects generally stayed. Most groups worked more than a typical 40-hour workweek.

Attrition is highest at the 4–5 year mark, with a sharp drop-off in departures after 5 years. This suggests a critical window for retention efforts. Employees who make it past 5 years are much more likely to stay.

Both leavers and stayers tend to have similar evaluation scores, though some employees with high evaluations still leave—often those who are overworked. This suggests that strong performance alone does not guarantee retention if other factors (like satisfaction or workload) are problematic.

Relationships Between Variables:

  • Satisfaction level is the strongest predictor of attrition. Employees who left had much lower satisfaction than those who stayed.
  • Number of projects and average monthly hours show a non-linear relationship: both underworked and overworked employees are more likely to leave, while those with a moderate workload tend to stay.
  • Employee evaluation (last performance review) has a weaker relationship with attrition compared to satisfaction or workload.
  • Tenure shows a moderate relationship with attrition: employees are most likely to leave at the 4–5 year mark, with departures dropping sharply after 5 years.
  • Promotion in the last 5 years is rare, and lack of promotion is associated with higher attrition.
  • Department and salary have only minor effects on attrition compared to satisfaction and workload.
  • Work accidents are slightly associated with lower attrition, possibly due to increased support after an incident.

Distributions in the Data:

  • Most variables (satisfaction, evaluation, monthly hours) are broadly distributed, with some skewness.
  • Tenure is concentrated around 3–4 years, with few employees beyond 5 years.
  • Number of projects is typically 3–4, but a small group has 6–7 projects (most of whom left).
  • Salary is heavily skewed toward low and medium bands.
  • There are no extreme outliers, but a few employees have unusually high tenure or monthly hours.

Data Transformations:

  • Renamed columns to standardized, snake_case format for consistency and easier coding.
  • Removed duplicate rows (about 3,000) to ensure each employee is only represented once.
  • Checked for and confirmed absence of missing values to avoid bias or errors in analysis.
  • Explored outliers but did not remove them at this stage, as their impact will be considered during modeling.

Purposes of EDA Before Modeling:

  • Understand the structure, quality, and distribution of the data.
  • Identify key variables and relationships that may influence attrition.
  • Detect and address data quality issues (duplicates, missing values, outliers).
  • Inform feature selection and engineering for modeling.
  • Ensure assumptions for modeling (e.g., independence, lack of multicollinearity) are reasonable.

Resources Used:

  • pandas documentation
  • matplotlib documentation
  • seaborn documentation
  • scikit-learn documentation
  • Kaggle HR Analytics Dataset

Ethical Considerations:

  • Ensure employee data privacy and confidentiality.
  • Avoid introducing or perpetuating bias in analysis or modeling.
  • Be transparent about how findings and predictions will be used.
  • Consider the impact of recommendations on employee well-being and fairness.

Note:
This data is clearly synthetic—it's too clean, and the clusters in the charts are much neater than what you’d see in real-world HR data.

Data Visualization and EDA¶

Back to top

Begin by understanding how many employees left and what percentage of all employees this figure represents.

Count Percent
left
Stayed 10000 83.4
Left 1991 16.6

Data visualizations¶

Now, examine variables that you're interested in, and create plots to visualize relationships between variables in the data.

I'll start with everything at once, then show individual plots

c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Left has a few subgroups (the absoultely miserable and overworked, and the dissatisfied and underworked, along with those that, presumably, normally leave). Violin plots will be more informative than boxplots to show these.

c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

Normalized above, true count below.

c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

Two big clusters of leavers: One absolutely miserable section that worked a lot of hours, and one clump that worked under a 40-hour week that was mildly dissatisfied.

It looks like almost the same plot for leavers. The absolutely miserable got pretty good evaluations, and the mildly dissatisfied got middling evaluations.

Mean Median
left
Stayed 0.667365 0.69
Left 0.440271 0.41

Those that left were 22% (mean) / 28% (median) less satisfied than those that stayed. Note the slight left skew with those that stayed (median higher than mean).

People especially quit at the 4 and 5 year mark. Almost nobody quits after 5 years. There's a group that just flees this company.

Tenure Left Count Percent
0 2 Stayed 2879 98.934708
1 2 Left 31 1.065292
2 3 Stayed 4316 83.159923
3 3 Left 874 16.840077
4 4 Stayed 1510 75.311721
5 4 Left 495 24.688279
6 5 Stayed 580 54.613936
7 5 Left 482 45.386064
8 6 Stayed 433 79.889299
9 6 Left 109 20.110701
10 7 Stayed 94 100.000000
11 7 Left 0 0.000000
12 8 Stayed 81 100.000000
13 8 Left 0 0.000000
14 10 Stayed 107 100.000000
15 10 Left 0 0.000000

Weird little clump of four-year employees that are miserable.

Item department Left Count Percent
0 IT Left 158.0 16.19
1 IT Stayed 818.0 83.81
2 RandD Left 85.0 12.25
3 RandD Stayed 609.0 87.75
4 accounting Left 109.0 17.55
5 accounting Stayed 512.0 82.45
6 hr Left 113.0 18.80
7 hr Stayed 488.0 81.20
8 management Left 52.0 11.93
9 management Stayed 384.0 88.07
10 marketing Left 112.0 16.64
11 marketing Stayed 561.0 83.36
12 product_mng Left 110.0 16.03
13 product_mng Stayed 576.0 83.97
14 sales Left 550.0 16.98
15 sales Stayed 2689.0 83.02
16 support Left 312.0 17.13
17 support Stayed 1509.0 82.87
18 technical Left 390.0 17.38
19 technical Stayed 1854.0 82.62

It's roughly proportional to the overall stay/leave split (83%/17%). Department doesn't appear to have a big impact. More granular details might help (i.e., subgroups of departments with bad managers may have higher attrition rates, but nothing currently jumps out).

Item salary Left Count Percent
0 high Left 48.0 4.85
1 high Stayed 942.0 95.15
2 low Left 1174.0 20.45
3 low Stayed 4566.0 79.55
4 medium Left 769.0 14.62
5 medium Stayed 4492.0 85.38

I'm not really seeing anything with salary either, beyond the expected attitude of low and high paid employees (note the 'high' salary group size is an order of magnitude smaller than 'low' and 'medium').

Look at that group of overworked employees, not getting promoted. All of the longest working employees left.

Item work_accident left Count Percent
0 No Left 1886.0 18.60
1 No Stayed 8255.0 81.40
2 Yes Left 105.0 5.68
3 Yes Stayed 1745.0 94.32

Seems like a fluke, but it's funny that having a work accident is correlated with being less likely to leave. Otherwise, normal

Item number_project left Count Percent
0 2 Left 857.0 54.17
1 2 Stayed 725.0 45.83
2 3 Left 38.0 1.08
3 3 Stayed 3482.0 98.92
4 4 Left 237.0 6.43
5 4 Stayed 3448.0 93.57
6 5 Left 343.0 15.36
7 5 Stayed 1890.0 84.64
8 6 Left 371.0 44.92
9 6 Stayed 455.0 55.08
10 7 Left 145.0 100.00
11 7 Stayed 0.0 0.00

Yeah, number of projects is a predictor. Might as well be a giant neon sign blinking here. Ha! 100% of people with 7 projects left.

No outliers for those who stayed. Mostly a function of small sample size? Those who left appear to have been either overworked or underworked. Who has 7 projects and works only 120 hours a week? Weird.

Look at that clump of miserable people with many projects.

c:\Users\johbr\anaconda3\envs\dev\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

No strong multicollinearity. Leaving is negatively correlated with satisfaction. Monthly hours, evaluations, and number of projects are somewhat positively correlated.

Insights¶

The data suggests significant issues with employee retention at this company. Two main groups of leavers emerge:

  • Underworked and Dissatisfied: Some employees worked on fewer projects and logged fewer hours than a standard work week, with below-average satisfaction. These individuals may have been disengaged, assigned less work as they prepared to leave, or possibly let go.
  • Overworked and Burned Out: Another group managed a high number of projects (up to 7) and worked exceptionally long hours (sometimes approaching 80-hour weeks). This group reported very low satisfaction and received few, if any, promotions.

Most employees work well above a typical 40-hour work week (160–184 hours/month, 20-23 work days/month), indicating a culture of overwork. The lack of promotions and high workload likely contribute to dissatisfaction and attrition.

Employee evaluation scores show only a weak relationship with attrition; both leavers and stayers have similar performance reviews. High-performing employees are not necessarily retained, especially if they are overworked or dissatisfied.

Other variables—such as department, salary, and work accidents—do not show strong predictive value for employee churn compared to satisfaction and workload.

Overall, the data points to management and workload issues as primary drivers of employee turnover.

paCe: Construct Stage¶

Back to top

  • Determine which models are most appropriate
  • Construct the model
  • Confirm model assumptions
  • Evaluate model results to determine how well your model fits the data

Overview of Models Used¶

Mostly a cheat sheet for myself for future reference

I'm learning, so I'm going to build and tune several types, but I suspect the random forest or gradient boosted model will perform best. First, a review of model options, and pros / cons of each.

Logistic regression: Interpretable, fast, good for simple relationships, but limited to linear patterns. Good for baselines, explainable models.

  • Needs: Scaled numeric features; imputed missing values.
  • Watch for: Multicollinearity, non-linear patterns it can't capture, outliers.
  • Good with: Clean, linearly separable data.
  • Ease of programming: sklearn.linear_model.LogisticRegression — one-liner with few params.
  • Key risk: Misleading results if the relationship is non-linear or assumptions (e.g. independence, no multicollinearity) are violated.

Decision trees: Transparent, handles non-linear data well, but overfits easily. Good for quick models, rule-based logic.

  • Needs: Clean, complete data; no need for scaling.
  • Watch for: Overfitting from deep trees or noisy features, outliers.
  • Good with: Categorical and mixed-type features; interpretable rules.
  • Ease of programming: sklearn.tree.DecisionTreeClassifier — fast setup, good for teaching.
  • Key risk: Overfitting, especially with deep trees or small sample sizes.

Random forests: Robust, reduces overfitting, high accuracy, but less interpretable, slower. Good for strong general-purpose performance.

  • Needs: Complete-ish data (some robustness); no scaling.
  • Watch for: Bias from dominant features; slower with high-dimensional data. Less sensitive to outliers than single trees.
  • Good with: Large feature sets, avoiding overfitting, feature importance.
  • Ease of programming: sklearn.ensemble.RandomForestClassifier — easy but slower to train.
  • Key risk: Slower on very large datasets; can be harder to interpret.

Gradient boosting: Best accuracy, learns from errors iteratively, but complex, needs tuning, less interpretable. Good for optimizing structured data problems.

  • Needs: Clean data; impute missing values (or use LightGBM); no scaling.
  • Watch for: Noisy (incorrect, inconsistent) labels, overlapping classes, overfitting if untuned.
  • Good with: Tabular data with complex interactions and nonlinearity.
  • Ease of programming: xgboost.XGBClassifier — requires parameter tuning but manageable.
  • Key risk: Overfitting if not properly regularized or if too many boosting rounds are used.

Logistic regression and decision trees are easiest to interpret. Gradient boosters usually predict the best.

The Plan¶

Meticulous planning and sequencing, out of fear of data leakage¶

Data cleaning & EDA (done)

Encode categorical variables

  • Scale numeric variables for logistic regression only
  • Remove outliers for logistic regression (maybe XGBoost)

Create df to save results

y = Create target left

X = Train/test split (hold test set until the very end)

  • stratify target, test size 20%

During initial explorations:

  • Use RandomizedSearchCV for RandomForest and XGBoost
  • GridSearchCV for Logistic Regression and Decision Trees

Baseline models

  • LogReg, Tree, RF, XGBoost
  • default hyperparameters, cross_val_score
  • compare accuracy, precision, recall, F1, ROC, AUC, confusion matrix
  • save metrics

Review Feature Importance and EDA, to inform model refinements Check the misclassified for where the model failed (in this case, due to "gray areas", groups of employees neither obviously safe or obviously at risk of quitting)

Feature Engineering Ideas

  • Binning: number of projects, monthly hours, satisfaction level, tenure
  • Interactions: satisfaction projects, satisfaction monthly hours, evaluation satisfaction, salary satisfaction, monthly hours / projects (could then bin hours / projects)
  • Categorical flags: no promotion 4+ years, burnout class (projects >= 6||5, hours >= 240, satisfaction <= 0.3), disengaged class (projects <= 2, hours < 160, satisfaction <= 0.5)
  • Feature selection: drop a few features, especially for improving logistic regression

Run featured engineered models, save metrics

Compare models and choose best for each type

  • refit on full X_train
  • use early stopping for xgboost when fitting (GridSearchCV + Pipeline do not support early_stopping_rounds)

Final test set evaluation (one per model)

Select a winner

Table of results (recall, precision, f1, accuracy, roc auc) Feature Importance Plots Decision Tree Viz ROC Curve (PR-ROC) Plots Misclassification Analysis Model Interpretation Key findings (which model is best & why) Actionable business rec's Limitations and ethical concerns Relevant appendices

🔎

Recall model assumptions¶

Logistic Regression model assumptions

  • Outcome variable is categorical
  • Observations are independent of each other
  • No severe multicollinearity among X variables
  • No extreme outliers
  • Linear relationship between each X variable and the logit of the outcome variable
  • Sufficiently large sample size

Decison Trees, Random Forests, and Gradient Boosting assumptions

Added for comparison to LogReg

  • Do not need linearity
  • Do not need scaling
  • Tolerate multicollinearity
  • Affected by outliers (exception: random forests only mildly)
  • Need large sample size (exception: decision trees will work with limited data, but less accuracy)

💭

Reflect on these questions as you complete the constructing stage.¶

  • Do you notice anything odd?
  • Which independent variables did you choose for the model and why?
  • Are each of the assumptions met?
  • How well does your model fit the data?
  • Can you improve it? Is there anything you would change about the model?
  • What resources do you find yourself using as you complete this stage? (Make sure to include the links.)
  • Do you have any ethical considerations in this stage?

Observations from Baseline Model Building:

  • Logistic Regression performed much worse than tree-based models (recall: 0.24 vs. >0.90 for others). This suggests the relationship between features and attrition is highly non-linear, or that important interactions are not captured by a linear model.
  • Tree-based models (Decision Tree, Random Forest, XGBoost) all performed very well (recall >0.90, AUC >0.97), with XGBoost slightly ahead. Surprisingly strong for a shallow Decision Tree (max depth 4). This may indicate the data is either easy to separate or possibly a bit too “clean” (the dataset is synthetic).
  • Confusion matrices show very few false negatives for tree-based models, but Logistic Regression misses many true leavers.

Independent Variables Chosen:

  • All available features were included: satisfaction_level, last_evaluation, number_project, average_monthly_hours, tenure, work_accident, promotion_last_5years, salary (ordinal), and department (one-hot encoded).
  • This approach ensures the model can capture all possible relationships, especially since EDA showed satisfaction, workload, and tenure are strong predictors.

Model Assumptions Met:

  • Logistic Regression: Outliers were removed and features were scaled. Outcome is categorical and observations are independent (dropped duplicates). Sample size is ample. Multicollinearity was checked in heatmap at end of EDA. The poor performance suggests the linearity assumption is not met.
  • Tree-based models: No strong assumptions about feature scaling, linearity, or multicollinearity; these models are robust to the data structure provided.

Model Fit:

  • Tree-based models fit the data extremely well (recall, precision, and AUC all very high). This suggests strong predictive power, but also raises the possibility of overfitting.
  • Logistic Regression fits poorly, missing most true positives.

Potential Improvements:

  • Feature engineering: (Will do.) Create interaction terms or non-linear transformations (e.g., satisfaction × workload, tenure bins) to help linear models like Logistic Regression capture more complex relationships. Consider feature selection to remove redundant or less informative variables.
  • Interpretability: (Will do.) Use feature importance plots for tree-based models and SHAP values to explain individual predictions and overall model behavior. This will help stakeholders understand which factors drive attrition risk.
  • Model validation: (Done.) Rigorously check for data leakage by reviewing the entire data pipeline, ensuring all preprocessing steps are performed only on training data within cross-validation folds.
  • Class imbalance: (Might do.) Although recall is high, further address class imbalance by experimenting with resampling techniques (e.g., SMOTE, undersampling) or adjusting class weights, especially if the business wants to minimize false negatives.
  • Alternative Models: (Won't do anytime soon.) Try other algorithms (e.g., LightGBM, SVM, or neural networks) or ensemble approaches to see if performance or interpretability can be improved.
  • Time series data (Don't have it.) If this was real-world data, it would be nice to track changes over time in work satisfaction, performance reviews, workload, promotions, absences, etc.

Resources Used:

  • scikit-learn documentation
  • XGBoost documentation
  • pandas documentation
  • seaborn documentation
  • Kaggle HR Analytics Dataset

Ethical Considerations:

  • Ensure predictions are used to support employees (e.g., for retention efforts), not for punitive actions.
  • Ensure the model does not unfairly target or disadvantage specific groups (e.g., by department, salary, or tenure).
  • Clearly communicate how predictions are made and how they will be used by HR.
  • Protect employee data and avoid using sensitive or personally identifiable information.
  • Regularly audit the model for bias and unintended consequences after deployment.

Model Building¶

Back to top

  • Fit a model that predicts the outcome variable using two or more independent variables
  • Check model assumptions
  • Evaluate the model

Identify the type of prediction task.¶

Binary classification

Identify the types of models most appropriate for this task.¶

It's a prediction model. I'm building a logistic regression and Tree-based models (Decision tree, random forest, gradient boasting).

Modeling¶

Model prep¶

Choose evaluation metric¶

While ROC AUC is a common metric for evaluating binary classifiers—offering a threshold-independent measure of how well the model distinguishes between classes—it is not ideal for imbalanced problems like employee churn, where the positive class (those likely to leave) is much smaller and more critical to identify.

During model development, I did review ROC AUC to get a general sense of model discrimination. However, for model selection and tuning, I ultimately prioritized recall. A high recall ensures that we identify as many at-risk employees as possible, aligning with the company's goal to support retention through early intervention. Missing a potential churner (a false negative) is generally more costly than mistakenly flagging someone who is not at risk (a false positive), especially when interventions are supportive rather than punitive.

While precision is also important—since too many false positives could dilute resources or create unnecessary concern—recall is more aligned with a proactive retention strategy. This tradeoff assumes that HR interventions are constructive and that the company has systems in place to act ethically on model outputs.

To avoid unintended harm, I recommend implementing clear usage guidelines and transparency measures, ensuring that predictions are used to help employees, not penalize them. Calibration and regular fairness audits should accompany any deployment of the model.

Encode categorical variables¶

Original salary values:
 salary
low       5740
medium    5261
high       990
Name: count, dtype: int64

Encoded salary values:
 salary
0    5740
1    5261
2     990
Name: count, dtype: int64
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_monthly_hours', 'tenure', 'work_accident', 'left',
       'promotion_last_5years', 'salary', 'department_IT', 'department_RandD',
       'department_accounting', 'department_hr', 'department_management',
       'department_marketing', 'department_product_mng', 'department_sales',
       'department_support', 'department_technical'],
      dtype='object')

Split data into baseline train / test.¶

One set for tree-based models (decision tree, random forest, XGBoost), another set for logistic regression (which must have outliers removed and data normalized). Stratify the target variable each time to account for class imbalance.

I am realizing now (as I prepare to embark on feature transformation below) that some of this categorical encoding and outlier removal would more appropriately have been done in the Pipeline. But it's simple stuff. Not a dealbreaker.

Functions to make models, run models, plot confusion matrices and feature importances¶

There is a commented out block of code in run_model_evaluation. It groups the misclassied results by leavers and stayers, and prints a summary of descriptive stats for each column. Long story short, it shows that the models have trouble with "gray area" employees, neither clearly at-risk of leaving, nor clearly safe. In the real world, people leave jobs for reasons unrelated to the available data: new opportunities, family issues, mere whims, etc. It's a normal limitation of predictive models in HR

I leave it commented out, because the model printout during training is already a lot.

Baseline Models¶

Back to top

Define baseline model pipelines¶

Run the baseline models¶

Running model: Logistic Regression (base)...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Execution time for Logistic Regression (base): 5.62 seconds
Best parameters for Logistic Regression (base): {'model__C': 0.01, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best score for Logistic Regression (base): 0.9475 (recall)
Model Logistic Regression (base) saved successfully.

Running model: Decision Tree (base)...
Fitting 5 folds for each of 72 candidates, totalling 360 fits
Execution time for Decision Tree (base): 2.71 seconds
Best parameters for Decision Tree (base): {'model__class_weight': 'balanced', 'model__max_depth': 4, 'model__min_samples_leaf': 5, 'model__min_samples_split': 2}
Best score for Decision Tree (base): 0.9422 (recall)
Model Decision Tree (base) saved successfully.

Running model: Random Forest (base)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest (base): 199.22 seconds
Best parameters for Random Forest (base): {'model__n_estimators': 500, 'model__min_samples_split': 4, 'model__min_samples_leaf': 1, 'model__max_samples': 1.0, 'model__max_features': 1.0, 'model__max_depth': 3, 'model__class_weight': 'balanced'}
Best score for Random Forest (base): 0.9404 (recall)
Model Random Forest (base) saved successfully.

Running model: XGBoost (base)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost (base): 21.78 seconds
Best parameters for XGBoost (base): {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__n_estimators': 100, 'model__min_child_weight': 1, 'model__max_depth': 3, 'model__learning_rate': 0.1, 'model__gamma': 0.1, 'model__colsample_bytree': 1.0}
Best score for XGBoost (base): 0.9366 (recall)
Model XGBoost (base) saved successfully.

Baseline results¶

Model Evaluation Results:
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
0 Logistic Regression (base) 0.947508 0.642342 0.891388 0.485860 0.822232 18 {'model__C': 0.01, 'model__class_weight': 'bal... 0.947508 [[5919, 1509], [79, 1426]] 5.621748
1 Decision Tree (base) 0.942247 0.803103 0.945551 0.699767 0.923269 18 {'model__class_weight': 'balanced', 'model__ma... 0.942243 [[7355, 644], [92, 1501]] 2.705129
2 Random Forest (base) 0.940364 0.769784 0.964169 0.651588 0.906589 18 {'model__n_estimators': 500, 'model__min_sampl... 0.940354 [[7198, 801], [95, 1498]] 199.222148
3 XGBoost (base) 0.936598 0.910589 0.986294 0.885986 0.969454 18 {'model__subsample': 1.0, 'model__scale_pos_we... 0.936592 [[7807, 192], [101, 1492]] 21.777312

Observations on Baseline Results¶

  • XGBoost had the best overall performance across metrics, including recall, precision, F1, and ROC AUC.

  • Random Forest was close behind but took the longest to run.

  • Decision Tree was fast and reasonably strong—good for quick baselines or interpretation.

  • Logistic Regression severely underperformed in recall—unsuitable if false negatives are costly.

Summary of Observations from the Confusion Matrices:

  • Tree-based models (Decision Tree, Random Forest, XGBoost) show very high recall, correctly identifying most employees who left (true positives), with very few false positives. They also have relatively few false negatives, indicating strong overall performance.
  • Logistic Regression has a much higher number of false negatives, missing many employees who actually left. This results in lower recall and makes it less suitable for identifying at-risk employees.
  • Overall, the ensemble models (Random Forest and XGBoost) provide the best balance between correctly identifying leavers and minimizing incorrect predictions, while Logistic Regression struggles with this non-linear problem.

Check feature importance¶

After fitting baseline models, I reviewed the decision tree and feature importances. This step is not to guide feature selection yet, but rather to cross-check with the EDA and ensure the models are learning meaningful patterns.

I’m mindful not to overinterpret these plots—they can be intuitive and visually appealing, but heavy reliance risks overfitting and misleading conclusions. This is a calibration check, not a signal to optimize prematurely.

All models consistently identify low satisfaction and extreme workload (either very high or very low) as the most important predictors of employee attrition. This finding aligns with the exploratory data analysis (EDA). Tenure also emerges as a significant factor, matching a pattern around the 4-5 year mark observed in the EDA. In contrast, salary, department, and recent promotions have minimal predictive value in this dataset. These key features are especially prominent in the ensemble models (Random Forest and XGBoost), which are likely the most robust. While all models highlight these variables, it is important to note that decision trees are prone to overfitting, and logistic regression underperforms due to its inability to capture non-linear relationships present in the data.

Feature Engineering (Round One)¶

Back to top

Based on EDA and feature importance, focus on:

  • Satisfaction level (especially low values)
  • Extreme workload (very high or very low monthly hours, number of projects)
  • Tenure (especially the 4–5 year window)

Feature engineering steps to experiment with:

Binning:

  • Bin satisfaction_level (e.g., low/medium/high)
  • Bin average_monthly_hours (e.g., <160, 160–240, >240)
  • Bin number_project (e.g., ≤2, 3–5, ≥6)
  • Bin tenure (e.g., ≤3, 4–5, >5 years)

Interactions:

  • satisfaction_level * number_project
    • low: possibly disengaged or underperforming
    • high: possibly engaged top performer or healthy productivity
    • mid: potential burnout
  • satisfaction_level * average_monthly_hours
    • satisfaction given workload
    • low: burnout risk
    • high: engaged
  • evaluation * satisfaction
    • performace and morale
    • both low: possibly disengaged firing risk
    • both high: ideal employee
    • high eval, low satisfaction: attrition risk
  • monthly_hours / number_project
    • overwork / underwork index

Categorical Flags:

  • burnout: (projects ≥ 6 or hours ≥ 240) & satisfaction ≤ 0.3
  • disengaged: (projects ≤ 2 and hours < 160 and satisfaction ≤ 0.5)
  • no_promo_4yr: (promotion_last_5years == 0) & (tenure >= 4)

Feature Selection:

  • Drop weak predictors (e.g., department, salary, work_accident) for logistic regression, as they add noise and multicollinearity.

Note:
I used simple hyperparameters for quick testing of new features and combinations. I used a wide set of hyperparameters and walked away from the computer to enjoy life while it crunched data. I eventually settled on a strategy of exhaustively grid searching quick models and randomly searching heavy tree models. Once the best feature set was identified, I did a final round of model training with a more extensive hyperparameter grid for optimal performance.

Feature engineering functions¶

Feature selection¶

Define logistic regression models¶

Run logistic regression models¶

Running model: Logistic Regression with Binning...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression with Binning: 2.60 seconds
Best parameters for Logistic Regression with Binning: {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l2', 'model__solver': 'liblinear'}
Best score for Logistic Regression with Binning: 0.9375 (recall)
Model Logistic Regression with Binning not saved. Set save_model=True to save it.

Running model: Logistic Regression with Interaction...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression with Interaction: 4.22 seconds
Best parameters for Logistic Regression with Interaction: {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best score for Logistic Regression with Interaction: 0.9336 (recall)
Model Logistic Regression with Interaction not saved. Set save_model=True to save it.

Running model: Logistic Regression with Flags...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression with Flags: 1.62 seconds
Best parameters for Logistic Regression with Flags: {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best score for Logistic Regression with Flags: 0.9176 (recall)
Model Logistic Regression with Flags not saved. Set save_model=True to save it.

Running model: Logistic Regression with Binning (feature selection)...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression with Binning (feature selection): 1.49 seconds
Best parameters for Logistic Regression with Binning (feature selection): {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best score for Logistic Regression with Binning (feature selection): 0.9395 (recall)
Model Logistic Regression with Binning (feature selection) not saved. Set save_model=True to save it.

Running model: Logistic Regression with Interaction (feature selection)...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression with Interaction (feature selection): 2.05 seconds
Best parameters for Logistic Regression with Interaction (feature selection): {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l2', 'model__solver': 'liblinear'}
Best score for Logistic Regression with Interaction (feature selection): 0.9601 (recall)
Model Logistic Regression with Interaction (feature selection) not saved. Set save_model=True to save it.

Running model: Logistic Regression with Flags (feature selection)...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression with Flags (feature selection): 0.71 seconds
Best parameters for Logistic Regression with Flags (feature selection): {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best score for Logistic Regression with Flags (feature selection): 0.9176 (recall)
Model Logistic Regression with Flags (feature selection) not saved. Set save_model=True to save it.

Feature Engineered Model Evaluation Results:
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
4 Logistic Regression with Interaction (feature ... 0.960133 0.671312 0.891666 0.516071 0.841599 10 {'model__C': 0.1, 'model__class_weight': 'bala... 0.960133 [[6073, 1355], [60, 1445]] 2.050218
3 Logistic Regression with Binning (feature sele... 0.939535 0.753531 0.947481 0.629004 0.896451 14 {'model__C': 0.1, 'model__class_weight': 'bala... 0.939535 [[6594, 834], [91, 1414]] 1.489630
0 Logistic Regression with Binning 0.937542 0.755151 0.952159 0.632168 0.897571 26 {'model__C': 0.1, 'model__class_weight': 'bala... 0.937542 [[6607, 821], [94, 1411]] 2.604423
1 Logistic Regression with Interaction 0.933555 0.672571 0.901327 0.525627 0.846860 22 {'model__C': 0.1, 'model__class_weight': 'bala... 0.933555 [[6160, 1268], [100, 1405]] 4.217048
2 Logistic Regression with Flags 0.917608 0.803608 0.957952 0.714803 0.924437 21 {'model__C': 0.1, 'model__class_weight': 'bala... 0.917608 [[6877, 551], [124, 1381]] 1.620208

Observations of Feature-Engineered Logistic Regression Results¶

  • Logistic Regression with Flags (feature selection) had the highest recall and strong metrics, using only 6 features—making it highly interpretable and efficient.

  • Feature selection (removing department, salary, accident, etc.) simplified the model without hurting accuracy.

  • Interaction and binning features improved recall and F1 over the baseline, but not as much as the flag-based models.

  • Interpretability: These models are transparent and easy to explain—ideal for HR use.

  • Summary: With targeted feature engineering, logistic regression can approach the accuracy of complex models while staying simple and explainable.

Define tree-based feature engineering models¶

Run tree-based feature engineering models¶

Running model: Decision Tree with Binning...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree with Binning: 5.99 seconds
Best parameters for Decision Tree with Binning: {'model__class_weight': 'balanced', 'model__max_depth': 8, 'model__min_samples_leaf': 2, 'model__min_samples_split': 2}
Best score for Decision Tree with Binning: 0.9341 (recall)
Model Decision Tree with Binning not saved. Set save_model=True to save it.

Running model: Decision Tree with Interaction...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree with Interaction: 6.92 seconds
Best parameters for Decision Tree with Interaction: {'model__class_weight': 'balanced', 'model__max_depth': 6, 'model__min_samples_leaf': 3, 'model__min_samples_split': 2}
Best score for Decision Tree with Interaction: 0.9334 (recall)
Model Decision Tree with Interaction not saved. Set save_model=True to save it.

Running model: Decision Tree with Flags...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree with Flags: 4.09 seconds
Best parameters for Decision Tree with Flags: {'model__class_weight': 'balanced', 'model__max_depth': 6, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2}
Best score for Decision Tree with Flags: 0.9309 (recall)
Model Decision Tree with Flags not saved. Set save_model=True to save it.

Running model: Decision Tree with Binning (feature selection)...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree with Binning (feature selection): 5.85 seconds
Best parameters for Decision Tree with Binning (feature selection): {'model__class_weight': 'balanced', 'model__max_depth': 8, 'model__min_samples_leaf': 2, 'model__min_samples_split': 2}
Best score for Decision Tree with Binning (feature selection): 0.9353 (recall)
Model Decision Tree with Binning (feature selection) not saved. Set save_model=True to save it.

Running model: Decision Tree with Interaction (feature selection)...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree with Interaction (feature selection): 7.16 seconds
Best parameters for Decision Tree with Interaction (feature selection): {'model__class_weight': 'balanced', 'model__max_depth': 5, 'model__min_samples_leaf': 3, 'model__min_samples_split': 2}
Best score for Decision Tree with Interaction (feature selection): 0.9328 (recall)
Model Decision Tree with Interaction (feature selection) not saved. Set save_model=True to save it.

Running model: Decision Tree with Flags (feature selection)...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree with Flags (feature selection): 3.76 seconds
Best parameters for Decision Tree with Flags (feature selection): {'model__class_weight': 'balanced', 'model__max_depth': 6, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2}
Best score for Decision Tree with Flags (feature selection): 0.9297 (recall)
Model Decision Tree with Flags (feature selection) not saved. Set save_model=True to save it.

Running model: Random Forest with Binning...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest with Binning: 104.34 seconds
Best parameters for Random Forest with Binning: {'model__n_estimators': 300, 'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_samples': 1.0, 'model__max_features': 1.0, 'model__max_depth': 5, 'model__class_weight': 'balanced'}
Best score for Random Forest with Binning: 0.9278 (recall)
Model Random Forest with Binning not saved. Set save_model=True to save it.

Running model: Random Forest with Interaction...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest with Interaction: 197.87 seconds
Best parameters for Random Forest with Interaction: {'model__n_estimators': 300, 'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_samples': 1.0, 'model__max_features': 1.0, 'model__max_depth': 5, 'model__class_weight': 'balanced'}
Best score for Random Forest with Interaction: 0.9297 (recall)
Model Random Forest with Interaction not saved. Set save_model=True to save it.

Running model: Random Forest with Flags...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest with Flags: 82.22 seconds
Best parameters for Random Forest with Flags: {'model__n_estimators': 100, 'model__min_samples_split': 2, 'model__min_samples_leaf': 2, 'model__max_samples': 0.7, 'model__max_features': 1.0, 'model__max_depth': 3, 'model__class_weight': 'balanced'}
Best score for Random Forest with Flags: 0.9290 (recall)
Model Random Forest with Flags not saved. Set save_model=True to save it.

Running model: XGBoost with Binning...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost with Binning: 26.62 seconds
Best parameters for XGBoost with Binning: {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 2, 'model__reg_alpha': 1, 'model__n_estimators': 100, 'model__min_child_weight': 5, 'model__max_depth': 3, 'model__learning_rate': 0.2, 'model__gamma': 0.1, 'model__colsample_bytree': 0.6}
Best score for XGBoost with Binning: 0.9366 (recall)
Model XGBoost with Binning not saved. Set save_model=True to save it.

Running model: XGBoost with Interaction...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost with Interaction: 28.74 seconds
Best parameters for XGBoost with Interaction: {'model__subsample': 0.8, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 5, 'model__reg_alpha': 1, 'model__n_estimators': 100, 'model__min_child_weight': 1, 'model__max_depth': 3, 'model__learning_rate': 0.2, 'model__gamma': 0, 'model__colsample_bytree': 1.0}
Best score for XGBoost with Interaction: 0.9341 (recall)
Model XGBoost with Interaction not saved. Set save_model=True to save it.

Running model: XGBoost with Flags...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost with Flags: 21.70 seconds
Best parameters for XGBoost with Flags: {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 2, 'model__reg_alpha': 1, 'model__n_estimators': 100, 'model__min_child_weight': 5, 'model__max_depth': 3, 'model__learning_rate': 0.2, 'model__gamma': 0.1, 'model__colsample_bytree': 0.6}
Best score for XGBoost with Flags: 0.9347 (recall)
Model XGBoost with Flags not saved. Set save_model=True to save it.

Running model: Random Forest with Binning (feature selection)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest with Binning (feature selection): 78.62 seconds
Best parameters for Random Forest with Binning (feature selection): {'model__n_estimators': 300, 'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_samples': 1.0, 'model__max_features': 1.0, 'model__max_depth': 5, 'model__class_weight': 'balanced'}
Best score for Random Forest with Binning (feature selection): 0.9284 (recall)
Model Random Forest with Binning (feature selection) not saved. Set save_model=True to save it.

Running model: Random Forest with Interaction (feature selection)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest with Interaction (feature selection): 172.35 seconds
Best parameters for Random Forest with Interaction (feature selection): {'model__n_estimators': 300, 'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_samples': 1.0, 'model__max_features': 1.0, 'model__max_depth': 5, 'model__class_weight': 'balanced'}
Best score for Random Forest with Interaction (feature selection): 0.9303 (recall)
Model Random Forest with Interaction (feature selection) not saved. Set save_model=True to save it.

Running model: Random Forest with Flags (feature selection)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest with Flags (feature selection): 74.48 seconds
Best parameters for Random Forest with Flags (feature selection): {'model__n_estimators': 100, 'model__min_samples_split': 2, 'model__min_samples_leaf': 2, 'model__max_samples': 0.7, 'model__max_features': 1.0, 'model__max_depth': 3, 'model__class_weight': 'balanced'}
Best score for Random Forest with Flags (feature selection): 0.9290 (recall)
Model Random Forest with Flags (feature selection) not saved. Set save_model=True to save it.

Running model: XGBoost with Binning (feature selection)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost with Binning (feature selection): 18.45 seconds
Best parameters for XGBoost with Binning (feature selection): {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 1, 'model__reg_alpha': 0.1, 'model__n_estimators': 100, 'model__min_child_weight': 5, 'model__max_depth': 3, 'model__learning_rate': 0.2, 'model__gamma': 0, 'model__colsample_bytree': 1.0}
Best score for XGBoost with Binning (feature selection): 0.9360 (recall)
Model XGBoost with Binning (feature selection) not saved. Set save_model=True to save it.

Running model: XGBoost with Interaction (feature selection)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost with Interaction (feature selection): 17.94 seconds
Best parameters for XGBoost with Interaction (feature selection): {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 1, 'model__reg_alpha': 0.1, 'model__n_estimators': 100, 'model__min_child_weight': 5, 'model__max_depth': 3, 'model__learning_rate': 0.2, 'model__gamma': 0, 'model__colsample_bytree': 1.0}
Best score for XGBoost with Interaction (feature selection): 0.9316 (recall)
Model XGBoost with Interaction (feature selection) not saved. Set save_model=True to save it.

Running model: XGBoost with Flags (feature selection)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost with Flags (feature selection): 15.12 seconds
Best parameters for XGBoost with Flags (feature selection): {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 5, 'model__reg_alpha': 1, 'model__n_estimators': 300, 'model__min_child_weight': 5, 'model__max_depth': 3, 'model__learning_rate': 0.1, 'model__gamma': 0, 'model__colsample_bytree': 1.0}
Best score for XGBoost with Flags (feature selection): 0.9353 (recall)
Model XGBoost with Flags (feature selection) not saved. Set save_model=True to save it.

Feature Engineered Tree-Based Model Evaluation Results:
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
9 XGBoost with Binning 0.936598 0.906165 0.985069 0.877647 0.967786 26 {'model__subsample': 1.0, 'model__scale_pos_we... 0.936594 [[7791, 208], [101, 1492]] 26.615765
15 XGBoost with Binning (feature selection) 0.935970 0.911648 0.984534 0.888558 0.969871 14 {'model__subsample': 1.0, 'model__scale_pos_we... 0.935965 [[7812, 187], [102, 1491]] 18.453887
3 Decision Tree with Binning (feature selection) 0.935342 0.899758 0.959523 0.866783 0.965388 14 {'model__class_weight': 'balanced', 'model__ma... 0.935331 [[7770, 229], [103, 1490]] 5.850951
17 XGBoost with Flags (feature selection) 0.935342 0.914391 0.983656 0.894358 0.970913 9 {'model__subsample': 1.0, 'model__scale_pos_we... 0.935334 [[7823, 176], [103, 1490]] 15.118324
11 XGBoost with Flags 0.934714 0.911819 0.985902 0.890018 0.969975 21 {'model__subsample': 1.0, 'model__scale_pos_we... 0.934709 [[7815, 184], [104, 1489]] 21.695929

Patterns in Results thus far¶

  • Recall is consistently high across all models, especially for Logistic Regression and Decision Tree (base), indicating strong sensitivity to identifying leavers.
  • F1 and Precision are much lower for Logistic Regression (base), suggesting many false positives. Tree-based and XGBoost models have much better balance between recall and precision.
  • ROC AUC is highest for XGBoost and Random Forest, showing strong overall discrimination.
  • Feature selection and engineering (binning, interaction, flags) generally improves F1, precision, and accuracy, sometimes at a small cost to recall.
  • Reducing features (feature selection) often maintains or even improves performance, especially for XGBoost and Decision Tree, and greatly reduces model complexity and training time.
  • Confusion matrices show that most errors are false positives (predicting leave when they stay), which is expected with class_weight='balanced' and high recall focus.

Feature Engineering (Round Two)¶

Back to top

What? Why?¶

Really, this is a feature shrinking round. Some feature engineering paired with a lot of feature selection. Feature-rich models have barely improved or even reduced performance, and feature selection has performed well.

Simpler models are easier to explain to stakeholders, and it'll hopefully reduce noise and potential multicollinearity.

Selected features + burnout flag: This set isolates the core predictors of attrition (satisfaction, workload, tenure, promotion) and adds a “burnout” flag to capture the high-risk group of overworked, dissatisfied employees

Selected features + interactions: This set focuses on the main drivers (satisfaction, workload, tenure) and adds interaction terms (satisfaction × projects, hours per project) to capture non-linear effects and workload intensity, which EDA showed are important for distinguishing between underworked, overworked, and healthy employees.

Selected features + interactions + burnout flag: This feature set combines the core predictors of attrition (satisfaction, workload, tenure) with a “burnout” flag to capture high-risk, overworked employees. It also includes a key interaction term, "satisfaction × projects", to distinguish between groups identified in EDA

satisfaction_x_projects separates healthy, burned-out, and underperforming employees:

  • Employees who are satisfied and productive (high satisfaction, moderate projects)
  • Those who are overworked and dissatisfied (low satisfaction, high projects)
  • Those who are disengaged (low satisfaction, low projects)

hours per project captures nuanced patterns of overwork and underwork:

  • Employees with many projects but reasonable hours (healthy workload)
  • Employees with few projects but high hours (potentially inefficient or struggling)
  • Employees with many projects and high hours (burnout risk)

Define feature engineering round 2 models¶

Run feature engineering round 2 models¶

Running model: Logistic Regression (Core + Burnout)...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression (Core + Burnout): 0.57 seconds
Best parameters for Logistic Regression (Core + Burnout): {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best score for Logistic Regression (Core + Burnout): 0.9349 (recall)
Model Logistic Regression (Core + Burnout) not saved. Set save_model=True to save it.

Running model: Logistic Regression (Core + Interactions)...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression (Core + Interactions): 0.78 seconds
Best parameters for Logistic Regression (Core + Interactions): {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best score for Logistic Regression (Core + Interactions): 0.9621 (recall)
Model Logistic Regression (Core + Interactions) not saved. Set save_model=True to save it.

Running model: Logistic Regression (Core + Interactions + Burnout)...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Execution time for Logistic Regression (Core + Interactions + Burnout): 0.99 seconds
Best parameters for Logistic Regression (Core + Interactions + Burnout): {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l2', 'model__solver': 'liblinear'}
Best score for Logistic Regression (Core + Interactions + Burnout): 0.9515 (recall)
Model Logistic Regression (Core + Interactions + Burnout) not saved. Set save_model=True to save it.

Running model: Decision Tree (Core + Burnout)...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree (Core + Burnout): 3.13 seconds
Best parameters for Decision Tree (Core + Burnout): {'model__class_weight': 'balanced', 'model__max_depth': 5, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2}
Best score for Decision Tree (Core + Burnout): 0.9435 (recall)
Model Decision Tree (Core + Burnout) not saved. Set save_model=True to save it.

Running model: Decision Tree (Core + Interactions)...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree (Core + Interactions): 3.51 seconds
Best parameters for Decision Tree (Core + Interactions): {'model__class_weight': 'balanced', 'model__max_depth': 8, 'model__min_samples_leaf': 3, 'model__min_samples_split': 2}
Best score for Decision Tree (Core + Interactions): 0.9316 (recall)
Model Decision Tree (Core + Interactions) not saved. Set save_model=True to save it.

Running model: Decision Tree (Core + Interactions + Burnout)...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Execution time for Decision Tree (Core + Interactions + Burnout): 2.92 seconds
Best parameters for Decision Tree (Core + Interactions + Burnout): {'model__class_weight': 'balanced', 'model__max_depth': 5, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2}
Best score for Decision Tree (Core + Interactions + Burnout): 0.9303 (recall)
Model Decision Tree (Core + Interactions + Burnout) not saved. Set save_model=True to save it.

Running model: Random Forest (Core + Burnout)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest (Core + Burnout): 71.53 seconds
Best parameters for Random Forest (Core + Burnout): {'model__n_estimators': 300, 'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_samples': 1.0, 'model__max_features': 1.0, 'model__max_depth': 5, 'model__class_weight': 'balanced'}
Best score for Random Forest (Core + Burnout): 0.9410 (recall)
Model Random Forest (Core + Burnout) not saved. Set save_model=True to save it.

Running model: Random Forest (Core + Interactions)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest (Core + Interactions): 91.35 seconds
Best parameters for Random Forest (Core + Interactions): {'model__n_estimators': 100, 'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_samples': 1.0, 'model__max_features': 1.0, 'model__max_depth': 3, 'model__class_weight': 'balanced'}
Best score for Random Forest (Core + Interactions): 0.9278 (recall)
Model Random Forest (Core + Interactions) not saved. Set save_model=True to save it.

Running model: Random Forest (Core + Interactions + Burnout)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for Random Forest (Core + Interactions + Burnout): 73.98 seconds
Best parameters for Random Forest (Core + Interactions + Burnout): {'model__n_estimators': 100, 'model__min_samples_split': 3, 'model__min_samples_leaf': 2, 'model__max_samples': 1.0, 'model__max_features': 1.0, 'model__max_depth': 5, 'model__class_weight': 'balanced'}
Best score for Random Forest (Core + Interactions + Burnout): 0.9297 (recall)
Model Random Forest (Core + Interactions + Burnout) not saved. Set save_model=True to save it.

Running model: XGBoost (Core + Burnout)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost (Core + Burnout): 15.51 seconds
Best parameters for XGBoost (Core + Burnout): {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 2, 'model__reg_alpha': 1, 'model__n_estimators': 100, 'model__min_child_weight': 5, 'model__max_depth': 3, 'model__learning_rate': 0.2, 'model__gamma': 0.1, 'model__colsample_bytree': 0.6}
Best score for XGBoost (Core + Burnout): 0.9366 (recall)
Model XGBoost (Core + Burnout) not saved. Set save_model=True to save it.

Running model: XGBoost (Core + Interactions)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost (Core + Interactions): 15.30 seconds
Best parameters for XGBoost (Core + Interactions): {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 2, 'model__reg_alpha': 1, 'model__n_estimators': 100, 'model__min_child_weight': 5, 'model__max_depth': 3, 'model__learning_rate': 0.2, 'model__gamma': 0.1, 'model__colsample_bytree': 0.6}
Best score for XGBoost (Core + Interactions): 0.9309 (recall)
Model XGBoost (Core + Interactions) not saved. Set save_model=True to save it.

Running model: XGBoost (Core + Interactions + Burnout)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Execution time for XGBoost (Core + Interactions + Burnout): 14.74 seconds
Best parameters for XGBoost (Core + Interactions + Burnout): {'model__subsample': 1.0, 'model__scale_pos_weight': 5.02134337727558, 'model__reg_lambda': 2, 'model__reg_alpha': 1, 'model__n_estimators': 100, 'model__min_child_weight': 5, 'model__max_depth': 3, 'model__learning_rate': 0.2, 'model__gamma': 0.1, 'model__colsample_bytree': 0.6}
Best score for XGBoost (Core + Interactions + Burnout): 0.9322 (recall)
Model XGBoost (Core + Interactions + Burnout) not saved. Set save_model=True to save it.

Feature Engineered Round 2 Model Evaluation Results:
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
1 Logistic Regression (Core + Interactions) 0.962126 0.666974 0.888335 0.510398 0.838128 6 {'model__C': 0.1, 'model__class_weight': 'bala... 0.962126 [[6039, 1389], [57, 1448]] 0.781924
2 Logistic Regression (Core + Interactions + Bur... 0.951495 0.689290 0.903177 0.540377 0.855480 6 {'model__C': 0.1, 'model__class_weight': 'bala... 0.951495 [[6210, 1218], [73, 1432]] 0.986237
3 Decision Tree (Core + Burnout) 0.943503 0.813972 0.956257 0.715714 0.928378 7 {'model__class_weight': 'balanced', 'model__ma... 0.943495 [[7402, 597], [90, 1503]] 3.126584
6 Random Forest (Core + Burnout) 0.940992 0.836729 0.976981 0.753266 0.939012 7 {'model__n_estimators': 300, 'model__min_sampl... 0.940991 [[7508, 491], [94, 1499]] 71.533977
9 XGBoost (Core + Burnout) 0.936598 0.905615 0.983741 0.876616 0.967577 7 {'model__subsample': 1.0, 'model__scale_pos_we... 0.936592 [[7789, 210], [101, 1492]] 15.508418

Model Evaluation Results¶

Back to top

All Model Evaluation Results:
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
28 Logistic Regression (Core + Interactions) 0.962126 0.666974 0.888335 0.510398 0.838128 6 {'model__C': 0.1, 'model__class_weight': 'bala... 0.962126 [[6039, 1389], [57, 1448]] 0.781924
4 Logistic Regression with Interaction (feature ... 0.960133 0.671312 0.891666 0.516071 0.841599 10 {'model__C': 0.1, 'model__class_weight': 'bala... 0.960133 [[6073, 1355], [60, 1445]] 2.050218
29 Logistic Regression (Core + Interactions + Bur... 0.951495 0.689290 0.903177 0.540377 0.855480 6 {'model__C': 0.1, 'model__class_weight': 'bala... 0.951495 [[6210, 1218], [73, 1432]] 0.986237
0 Logistic Regression (base) 0.947508 0.642342 0.891388 0.485860 0.822232 18 {'model__C': 0.01, 'model__class_weight': 'bal... 0.947508 [[5919, 1509], [79, 1426]] 5.621748
30 Decision Tree (Core + Burnout) 0.943503 0.813972 0.956257 0.715714 0.928378 7 {'model__class_weight': 'balanced', 'model__ma... 0.943495 [[7402, 597], [90, 1503]] 3.126584
1 Decision Tree (base) 0.942247 0.803103 0.945551 0.699767 0.923269 18 {'model__class_weight': 'balanced', 'model__ma... 0.942243 [[7355, 644], [92, 1501]] 2.705129
31 Random Forest (Core + Burnout) 0.940992 0.836729 0.976981 0.753266 0.939012 7 {'model__n_estimators': 300, 'model__min_sampl... 0.940991 [[7508, 491], [94, 1499]] 71.533977
2 Random Forest (base) 0.940364 0.769784 0.964169 0.651588 0.906589 18 {'model__n_estimators': 500, 'model__min_sampl... 0.940354 [[7198, 801], [95, 1498]] 199.222148
5 Logistic Regression with Binning (feature sele... 0.939535 0.753531 0.947481 0.629004 0.896451 14 {'model__C': 0.1, 'model__class_weight': 'bala... 0.939535 [[6594, 834], [91, 1414]] 1.489630
6 Logistic Regression with Binning 0.937542 0.755151 0.952159 0.632168 0.897571 26 {'model__C': 0.1, 'model__class_weight': 'bala... 0.937542 [[6607, 821], [94, 1411]] 2.604423

Model Evaluation Summary¶

1. Logistic Regression¶

  • Best Recall:
    • Logistic Regression (Core + Interactions) achieves the highest recall (0.962), with only 6 features and a simple, interpretable model.
    • Other logistic regression variants with feature selection or binning also maintain high recall (0.94–0.96) with fewer features.
  • F1 & Precision:
    • F1 scores for logistic regression are generally lower (0.64–0.75), reflecting lower precision (0.51–0.63).
    • Feature selection and engineering (e.g., interactions, binning) slightly improve F1 and precision while keeping models simple.

2. Tree-Based Models (Decision Tree, Random Forest, XGBoost)¶

  • Top F1 & Precision:
    • XGBoost and Random Forest models consistently achieve the highest F1 (up to 0.91) and precision (up to 0.89), with strong recall (0.93–0.94).
    • Decision Trees also perform well, especially with feature engineering (F1 up to 0.89, precision up to 0.87).
  • Feature Efficiency:
    • Tree-based models with feature selection or engineered features (e.g., "Core + Burnout", "feature selection") often match or outperform base models with fewer features.

3. Feature Selection & Engineering¶

  • Effectiveness:
    • Models using feature selection or engineered features (interactions, binning, flags) often achieve similar or better performance with fewer features.
    • This reduces model complexity and improves interpretability without sacrificing accuracy, recall, or F1.

4. Interpretability vs. Performance¶

  • Trade-off:
    • Logistic regression models are more interpretable and, with feature engineering, are now much more competitive in recall and accuracy.
    • Tree-based models remain top performers for F1 and precision, but at the cost of increased complexity.

Conclusion:

  • Feature selection and engineering are highly effective, enabling simpler models (especially logistic regression) to achieve strong recall and competitive accuracy.
  • Tree-based models (especially XGBoost) remain the best for F1 and precision, but logistic regression is now a viable, interpretable alternative for high-recall use cases.
--- Sorted by f1 (descending) ---
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
15 XGBoost with Interaction 0.934087 0.917952 0.983251 0.902365 0.972269 22 {'model__subsample': 0.8, 'model__scale_pos_we... 0.934077 [[7838, 161], [105, 1488]] 28.742721
19 XGBoost with Interaction (feature selection) 0.931576 0.917465 0.983235 0.903776 0.972164 10 {'model__subsample': 1.0, 'model__scale_pos_we... 0.931569 [[7841, 158], [109, 1484]] 17.935125
26 Random Forest with Binning (feature selection) 0.928437 0.916641 0.974975 0.905141 0.971956 14 {'model__n_estimators': 300, 'model__min_sampl... 0.928432 [[7844, 155], [114, 1479]] 78.624338
27 Random Forest with Binning 0.927809 0.916305 0.974756 0.905083 0.971852 26 {'model__n_estimators': 300, 'model__min_sampl... 0.927805 [[7844, 155], [115, 1478]] 104.340903
13 XGBoost with Flags (feature selection) 0.935342 0.914391 0.983656 0.894358 0.970913 9 {'model__subsample': 1.0, 'model__scale_pos_we... 0.935334 [[7823, 176], [103, 1490]] 15.118324
--- Sorted by accuracy (descending) ---
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
15 XGBoost with Interaction 0.934087 0.917952 0.983251 0.902365 0.972269 22 {'model__subsample': 0.8, 'model__scale_pos_we... 0.934077 [[7838, 161], [105, 1488]] 28.742721
19 XGBoost with Interaction (feature selection) 0.931576 0.917465 0.983235 0.903776 0.972164 10 {'model__subsample': 1.0, 'model__scale_pos_we... 0.931569 [[7841, 158], [109, 1484]] 17.935125
26 Random Forest with Binning (feature selection) 0.928437 0.916641 0.974975 0.905141 0.971956 14 {'model__n_estimators': 300, 'model__min_sampl... 0.928432 [[7844, 155], [114, 1479]] 78.624338
27 Random Forest with Binning 0.927809 0.916305 0.974756 0.905083 0.971852 26 {'model__n_estimators': 300, 'model__min_sampl... 0.927805 [[7844, 155], [115, 1478]] 104.340903
13 XGBoost with Flags (feature selection) 0.935342 0.914391 0.983656 0.894358 0.970913 9 {'model__subsample': 1.0, 'model__scale_pos_we... 0.935334 [[7823, 176], [103, 1490]] 15.118324
--- Sorted by roc_auc (descending) ---
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
3 XGBoost (base) 0.936598 0.910589 0.986294 0.885986 0.969454 18 {'model__subsample': 1.0, 'model__scale_pos_we... 0.936592 [[7807, 192], [101, 1492]] 21.777312
14 XGBoost with Flags 0.934714 0.911819 0.985902 0.890018 0.969975 21 {'model__subsample': 1.0, 'model__scale_pos_we... 0.934709 [[7815, 184], [104, 1489]] 21.695929
10 XGBoost with Binning 0.936598 0.906165 0.985069 0.877647 0.967786 26 {'model__subsample': 1.0, 'model__scale_pos_we... 0.936594 [[7791, 208], [101, 1492]] 26.615765
11 XGBoost with Binning (feature selection) 0.935970 0.911648 0.984534 0.888558 0.969871 14 {'model__subsample': 1.0, 'model__scale_pos_we... 0.935965 [[7812, 187], [102, 1491]] 18.453887
32 XGBoost (Core + Burnout) 0.936598 0.905615 0.983741 0.876616 0.967577 7 {'model__subsample': 1.0, 'model__scale_pos_we... 0.936592 [[7789, 210], [101, 1492]] 15.508418
--- Sorted by precision (descending) ---
model recall f1 roc_auc precision accuracy features best_params cv_best_score conf_matrix search_time
26 Random Forest with Binning (feature selection) 0.928437 0.916641 0.974975 0.905141 0.971956 14 {'model__n_estimators': 300, 'model__min_sampl... 0.928432 [[7844, 155], [114, 1479]] 78.624338
27 Random Forest with Binning 0.927809 0.916305 0.974756 0.905083 0.971852 26 {'model__n_estimators': 300, 'model__min_sampl... 0.927805 [[7844, 155], [115, 1478]] 104.340903
19 XGBoost with Interaction (feature selection) 0.931576 0.917465 0.983235 0.903776 0.972164 10 {'model__subsample': 1.0, 'model__scale_pos_we... 0.931569 [[7841, 158], [109, 1484]] 17.935125
15 XGBoost with Interaction 0.934087 0.917952 0.983251 0.902365 0.972269 22 {'model__subsample': 0.8, 'model__scale_pos_we... 0.934077 [[7838, 161], [105, 1488]] 28.742721
13 XGBoost with Flags (feature selection) 0.935342 0.914391 0.983656 0.894358 0.970913 9 {'model__subsample': 1.0, 'model__scale_pos_we... 0.935334 [[7823, 176], [103, 1490]] 15.118324

Logistic Regression Top Pick:

Logistic Regression (Core + Interactions) Recall: 0.962 (highest among all models) F1: 0.667 (moderate) Precision: 0.51 (lower, but expected with high recall) Features: 6 (very simple, highly interpretable) Why: Achieves the highest recall, which is critical for identifying as many at-risk employees as possible. Uses only 6 features, making it easy to explain to HR and stakeholders. Slightly lower F1 and precision, but this is a common trade-off when maximizing recall. Good for organizations prioritizing interpretability and proactive retention. Alternative:

Logistic Regression with Interaction (feature selection) Recall: 0.960 (very close to top) F1: 0.671 (slightly higher) Precision: 0.52 (slightly higher) Features: 10 (still simple) Why: Slightly better F1 and precision, with a small drop in recall. Still interpretable and efficient.


Decision Tree Top Pick:

Decision Tree (Core + Burnout) Recall: 0.944 (highest among DTs) F1: 0.814 Precision: 0.72 Features: 7 Depth: 5 → interpretable Why: Best recall for DTs, with strong F1 and precision. Simple model, easy to visualize and explain. Relatively shallow depth helps interpretability. Captures key non-linear relationships (e.g., burnout).

Alternative:

Decision Tree (base) Recall: 0.942 F1: 0.803 Precision: 0.70 Features: 18 Why: Slightly lower recall, more features, but still interpretable. Deeper and less parsimonious. Useful if you want to see the effect of all variables.


Random Forest Top Pick:

Random Forest (Core + Burnout) Recall: 0.941 (highest among RFs) F1: 0.837 (best among all models except XGB) Precision: 0.75 Features: 7 Max Depth: 5 (controlled complexity) Why: Best recall and F1 for RFs, with a compact feature set. Limited feature set and shallow trees improve generalizability. Balances predictive power and interpretability. Efficient for deployment.

Alternative:

Random Forest (base) Recall: 0.940 F1: 0.770 Precision: 0.65 Features: 18 Why: Slightly lower recall and F1, but includes all features. Useful for feature importance analysis.


XGBoost Top Pick:

XGBoost (base) Recall: 0.937 (highest among XGBs) F1: 0.911 (highest overall) Precision: 0.89 (highest overall) ROC AUC: 0.986 (highest overall) Features: 18 Why: Best balance of recall, F1, precision, and ROC AUC. Best overall performer in general metrics. Excellent for minimizing both false negatives and false positives. Slightly more complex, but worth it for performance.

Alternative:

XGBoost (Core + Burnout) Recall: 0.937 (same as base) F1: 0.906 Precision: 0.88 Features: 7 Why: Nearly identical recall, slightly lower F1/precision, but much simpler. Good if you want a more interpretable XGBoost model.

Alternates: XGB with Binning (recall tied, more compact features) XGB with Flags (feature selection) (best interpretability: only 9 features, F1=0.914, recall=0.935)

If interpretability or runtime matters more than a slight edge in F1, pick the XGB with Flags.

Total execution time: 1585.16 seconds
Total execution time: 00:26:25

pacE: Execute Stage¶

Back to top

  • Interpret model performance and results
  • Share actionable steps with stakeholders

I passed the point of diminishing returns long ago.

But, I learned a lot of foundational stuff about the model construction process (Pipelines, cross-validation, random search vs. grid serach, checking misclassification errors, feature selection and engineering, etc), and i did get the logistic regression model a bit better, so I'll call it a win. The time may have been wasted now, but I'll be a lot quicker next time. Nothing like mastering the basics.

✏

Recall evaluation metrics¶

  • AUC is the area under the ROC curve; it's also considered the probability that the model ranks a random positive example more highly than a random negative example.
  • Precision measures the proportion of data points predicted as True that are actually True, in other words, the proportion of positive predictions that are true positives.
  • Recall measures the proportion of data points that are predicted as True, out of all the data points that are actually True. In other words, it measures the proportion of positives that are correctly classified.
  • Accuracy measures the proportion of data points that are correctly classified.
  • F1-score is an aggregation of precision and recall.

💭

Reflect on these questions as you complete the executing stage.¶

  • What key insights emerged from your model(s)?
  • What business recommendations do you propose based on the models built?
  • What potential recommendations would you make to your manager/company?
  • Do you think your model could be improved? Why or why not? How?
  • Given what you know about the data and the models you were using, what other questions could you address for the team?
  • What resources do you find yourself using as you complete this stage? (Make sure to include the links.)
  • Do you have any ethical considerations in this stage?

Execute Stage Reflection¶

What key insights emerged from your model(s)?¶

  • Satisfaction level and workload (number of projects, monthly hours) are the strongest predictors of attrition.
  • Two main at-risk groups: overworked/burned-out employees (many projects, long hours, low satisfaction) and underworked/disengaged employees (few projects, low satisfaction).
  • Tenure is important: attrition peaks at 4–5 years, then drops sharply.
  • Salary, department, and recent promotions have minimal predictive value.
  • Tree-based models (Random Forest, XGBoost) achieved the best balance of recall, precision, and F1. With feature engineering, logistic regression became competitive and highly interpretable.

What business recommendations do you propose based on the models built?¶

  • Monitor satisfaction and workload: Regularly survey employees and track workload to identify those at risk of burnout or disengagement.
  • Targeted retention efforts: Focus on employees with low satisfaction and extreme workloads, especially those at the 4–5 year tenure mark.
  • Promotions and recognition: Consider more frequent recognition or advancement opportunities.
  • Work-life balance: Encourage reasonable project loads and monthly hours to reduce burnout risk.

What potential recommendations would you make to your manager/company?¶

  • Implement early warning systems using the model to flag at-risk employees for supportive HR outreach.
  • Review workload distribution and ensure fair, manageable assignments.
  • Conduct stay interviews with employees approaching 4–5 years of tenure.
  • Communicate transparently about how predictive models are used, emphasizing support rather than punitive action.

Do you think your model could be improved? Why or why not? How?¶

  • Feature engineering: Further refine interaction terms or add time-based features if available.
  • External data: Incorporate additional data (e.g., engagement surveys, manager ratings, exit interview themes).
  • Model calibration: Regularly retrain and calibrate the model as new data becomes available.
  • Bias audits: Routinely check for bias across demographic groups.

Given what you know about the data and the models you were using, what other questions could you address for the team?¶

  • What are the specific reasons for attrition in different departments or roles?
  • Are there seasonal or cyclical patterns in attrition?
  • How do external factors (e.g., economic conditions, industry trends) affect turnover?
  • What interventions are most effective for retaining at-risk employees?

What resources do you find yourself using as you complete this stage? (Make sure to include the links.)¶

  • pandas documentation
  • matplotlib documentation
  • seaborn documentation
  • scikit-learn documentation
  • XGBoost documentation
  • Kaggle HR Analytics Dataset

Do you have any ethical considerations in this stage?¶

  • Data privacy: Ensure employee data is kept confidential and secure.
  • Fairness: Avoid using the model to unfairly target or penalize specific groups.
  • Transparency: Clearly communicate how predictions are generated and used.
  • Supportive use: Use predictions to offer support and resources, not for punitive measures.
  • Ongoing monitoring: Regularly audit the model for bias and unintended consequences.

Results and Evaluation¶

Back to top

  • Interpret model
  • Evaluate model performance using metrics
  • Prepare results, visualizations, and actionable steps to share with stakeholders

Summary of model results¶

Will do, after running X_test through model

Conclusion, Recommendations, Next Steps¶

Conclusion¶

  • Satisfaction level and workload (number of projects, monthly hours) are the strongest predictors of employee attrition.
  • Two main at-risk groups emerged: overworked/burned-out employees (many projects, long hours, low satisfaction) and underworked/disengaged employees (few projects, low satisfaction).
  • Tenure is important: attrition peaks at 4–5 years, then drops sharply.
  • Salary, department, and recent promotions have minimal predictive value.
  • Tree-based models (Random Forest, XGBoost) achieved the best balance of recall, precision, and F1. With feature engineering, logistic regression became competitive and highly interpretable.

Recommendations¶

  • Monitor satisfaction and workload: Regularly survey employees and track workload to identify those at risk of burnout or disengagement.
  • Targeted retention efforts: Focus on employees with low satisfaction and extreme workloads, especially those at the 4–5 year tenure mark.
  • Promotions and recognition: Consider more frequent recognition or advancement opportunities.
  • Work-life balance: Encourage reasonable project loads and monthly hours to reduce burnout risk.
  • Implement early warning systems: Use the model to flag at-risk employees for supportive HR outreach.
  • Review workload distribution: Ensure fair, manageable assignments.
  • Conduct stay interviews: Engage employees approaching 4–5 years of tenure.
  • Communicate transparently: Clearly explain how predictive models are used, emphasizing support rather than punitive action.

Next Steps¶

  • Model deployment: Integrate the predictive model into HR processes for early identification of at-risk employees.
  • Continuous improvement: Regularly retrain and calibrate the model as new data becomes available.
  • Expand data sources: Incorporate additional data (e.g., engagement surveys, manager ratings, exit interview themes) to improve model accuracy.
  • Bias and fairness audits: Routinely check for bias across demographic groups and monitor for unintended consequences.
  • Ethical safeguards: Ensure employee data privacy, fairness, and transparency in all predictive analytics initiatives.

Resources Used:

  • pandas documentation
  • matplotlib documentation
  • seaborn documentation
  • scikit-learn documentation
  • XGBoost documentation
  • Kaggle HR Analytics Dataset

Congratulations! You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.